Kuhn & Wickham 2017
"Preprocessing Tools to Create Design Matrices"
A recipe is a container for preprocessing steps to go from raw data to an analysis set.
May 15, 2018
recipes packageKuhn & Wickham 2017
"Preprocessing Tools to Create Design Matrices"
A recipe is a container for preprocessing steps to go from raw data to an analysis set.
A short introduction of recipes
The recipe for recipes
Example in which the recipe for recipes is applied
recipes packageA recipe is the specification of an intent, separate the planning from the doing.
recipetrain_set <- mtcars[1:20, c("am", "disp", "hp")]
test_set <- mtcars[21:32, c("am", "disp", "hp")]
rec <- recipe(train_set, am ~ .)
We have defined the roles here, am is the outcome, disp and hp are the predictors.
steps to the reciperec_with_steps <- rec %>% step_center(all_predictors()) %>% step_scale(all_predictors())
This is specifying the intent, we didn't do anything on the data yet.
prepprep acquires all the necesarry information on the training set.
(rec_prepped <- rec_with_steps %>% prep())
## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 2 ## ## Training data contained 20 data points and no missing data. ## ## Operations: ## ## Centering for disp, hp [trained] ## Scaling for disp, hp [trained]
We now have the statistics to apply the centering and scaling derived from the training set and stored in the recipe.
baketrain_final <- bake(rec_prepped, train_set) test_final <- bake(rec_prepped, test_set)
The statistics to center and scale are learned on the train_set and applied to the test_set.
head(test_final)
## # A tibble: 6 x 3 ## am disp hp ## <dbl> <dbl> <dbl> ## 1 0. -0.883 -0.653 ## 2 0. 0.652 0.230 ## 3 0. 0.544 0.230 ## 4 0. 0.901 1.81 ## 5 0. 1.29 0.647 ## 6 1. -1.20 -1.17
tidyGives information about the steps in a data frame.
tidy(rec_prepped)
## # A tibble: 2 x 5 ## number operation type trained skip ## <int> <chr> <chr> <lgl> <lgl> ## 1 1 step center TRUE FALSE ## 2 2 step scale TRUE FALSE
recipes?Added the check framework together with Max.
A check does not change the data in any way, it tests assumptions and will break bake if these are not met.
rec2 <- recipe(train_set) %>% check_missing(everything()) %>% prep() test_set[1, 1] <- NA train_baked <- bake(rec2, train_set) test_baked <- bake(rec2, test_set)
## Error: The following columns contain missing values: `am`.
Fully leverage package structure.
For your own preparations and to contribute to the package.
Challenge, delve a little deeper into the package inner workings.
recipesA recipe itself is of class recipe.
All the steps and checks available have their own subclass. Each with their own prep and bake functions.
The recipe gathers all the objects of different subclasses.
prep.recipe and bake.recipe call the prep and bake methods of its steps and checks.
step or checkA full step or check comprises:
prep methodbake methodprint methodtidy methodrecipesMy preferred way to create a new step or check:
recipes package yet.prep.step_<step_name> or check_<check_name> function.prep method.bake method.print method.tidy method.In Practical Data Science with R (Zumel and Mount) the authors define the signed log as:
if |x| < 1: 0
else: sign(x) * log(|x|)
signed_log <- function(x, base = exp(1)) {
ifelse(abs(x) < 1,
0,
sign(x) * log(abs(x), base = base))
}
We have one argument: base
Nothing has to be derived by prep.signed_log
step_signed_log_new <-
function(terms = NULL,
role = NA,
skip = FALSE,
trained = FALSE,
base = NULL,
columns = NULL) {
step(
subclass = "signed_log",
terms = terms,
role = role,
skip = skip,
trained = trained,
base = base,
columns = columns
)
}
step_signed_log <-
function(recipe,
...,
role = NA,
skip = FALSE,
trained = FALSE,
base = exp(1),
columns = NULL) {
add_step(
recipe,
step_signed_log_new(
terms = ellipse_check(...),
role = role,
skip = skip,
trained = trained,
base = base,
columns = columns
)
)
}
prep methodprep.step_signed_log <- function(x,
training,
info = NULL,
...) {
col_names <- terms_select(x$terms, info = info)
step_signed_log_new(
terms = x$terms,
role = x$role,
skip = x$skip,
trained = TRUE,
base = x$base,
columns = col_names
)
}
bake methodbake.step_signed_log <- function(object,
newdata,
...) {
col_names <- object$columns
for (i in seq_along(col_names)) {
col <- newdata[[ col_names[i] ]]
newdata[, col_names[i]] <-
ifelse(abs(col) < 1,
0,
sign(col) * log(abs(col), base = object$base))
}
as_tibble(newdata)
}
print methodprint.step_signed_log <-
function(x, width = max(20, options()$width - 30), ...) {
cat("Taking the signed log for ", sep = "")
printer(x$columns, x$terms, x$trained, width = width)
invisible(x)
}
tidy methodtidy.step_signed_log <- function(x, ...) {
if (is_trained(x)) {
res <- tibble(terms = x$columns)
} else {
res <- tibble(terms = sel2char(x$terms))
}
res
}
df <- data_frame(x = -2:2) recipe(df) %>% step_signed_log(x) %>% prep() %>% bake(df)
## # A tibble: 5 x 1 ## x ## <dbl> ## 1 -0.693 ## 2 0. ## 3 0. ## 4 0. ## 5 0.693
Within recipes you'll find a number of helper functions.
Clone the source code from https://github.com/topepo/recipes to access them.
On https://github.com/EdwinTh/recipe_for_recipes you will find a skeleton for new steps.
Slides and the skeleton can be found here:
https://github.com/EdwinTh/recipe_for_recipes
The source code for recipes is maintained here:
https://github.com/topepo/recipes/
Thorough introduction by Max Kuhn to the package:
https://www.rstudio.com/resources/webinars/creating-and-preprocessing-a-design-matrix-with-recipes/
@edwin_thoen
github.com/EdwinTh
Assure that the range of a numeric variable in a new set is approximately equal to the range of the variable in the train set.
Throw informative error when on one or both ends the new variable exceeds the original range plus some slack.
range_check_func <- function(x,
lower,
upper,
slack_prop = 0.05,
colname = "x") {
min_x <- min(x); max_x <- max(x); slack <- (upper - lower) * slack_prop
if (min_x < (lower - slack) & max_x > (upper + slack)) {
stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack,
"\n", "max x is ", max_x, ", upper bound is ", upper + slack,
call. = FALSE)
} else if (min_x < (lower - slack)) {
stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack,
call. = FALSE)
} else if (max_x > (upper + slack)) {
stop("max ", colname, " is ", max_x, ", upper bound is ", upper + slack,
call. = FALSE)
}
}
slack_prop is an argument provided by the user.
lower and upper should be calculated by the prep.check_range method.
check_range_new <-
function(terms = NULL,
role = NA,
trained = FALSE,
lower = NULL,
upper = NULL,
slack_prop = NULL) {
check(subclass = "range",
terms = terms,
role = role,
trained = trained,
lower = lower,
upper = upper,
slack_prop = slack_prop)
}
check_range <-
function(recipe,
...,
role = NA,
trained = FALSE,
lower = NULL,
upper = NULL,
slack_prop = 0.05) {
add_check(
recipe,
check_range_new(
terms = ellipse_check(...),
role = role,
trained = trained,
lower = lower,
upper = upper,
slack_prop = slack_prop
)
)
}
prep methodprep.check_range <-
function(x,
training,
info = NULL,
...) {
col_names <- terms_select(x$terms, info = info)
lower_vals <- vapply(training[ ,col_names], min, c(min = 1),
na.rm = TRUE)
upper_vals <- vapply(training[ ,col_names], max, c(max = 1),
na.rm = TRUE)
check_range_new(
x$terms,
role = x$role,
trained = TRUE,
lower = lower_vals,
upper = upper_vals,
slack_prop = x$slack_prop
)
}
bake methodbake.check_range <- function(object,
newdata,
...) {
col_names <- names(object$lower)
for (i in seq_along(col_names)) {
colname <- col_names[i]
range_check_func(newdata[[ colname ]],
object$lower[colname],
object$upper[colname],
object$slack_prop,
colname)
}
as_tibble(newdata)
}
print methodprint.check_range <-
function(x, width = max(20, options()$width - 30), ...) {
cat("Checking range of ", sep = "")
printer(names(x$lower), x$terms, x$trained, width = width)
invisible(x)
}
tidy methodtidy.check_range <- function(x, ...) {
if (is_trained(x)) {
res <- tibble(terms = x$columns)
} else {
res <- tibble(terms = sel2char(x$terms))
}
res
}
df1 <- data_frame(x = -1:1) df2 <- data_frame(x = -2:2) recipe(df1) %>% check_range(x) %>% prep() %>% bake(df2)
## Error: min x is -2, lower bound is -1.1 ## max x is 2, upper bound is 1.1